#Explore & Summarize Data Using R By Rawan Alghamdi

Introduction

This project aims at exploring and analysing a dataset about red wine quality using special statistical programming language which is R. The dataset includes 1599 observations of 13 variables.

Univariate Plots Section

This is the summary of the dataset

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

This shows the different data types and values of the dataset variables

The dataset contains of 1599 observations of 13 variables.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This section visulizes a plot for each variable in the dataset. A description of the shape/center/spread of the plot (histogram) is stated clearely under each one.

fixed.acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The distribution of the fixed.acidity histogram is right-skewed. The range of data is 11.3. The main peak is approximately at 7.There is a small gap between the range 15 and 15.5. Most observations fall in the range 7.10 - 9.20

volatile.acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The distribution of the volatile.acidity histogram is right-skewed with a short tail in the right. The range of data is 1.46. There are two peaks at approxemately 0.4 and 0.6 so, we can say that this plot is bimodal.Most observations fall in the range 0.3 - 0.64.There is a gap after approxemately 1.3 so, there is an outliers in the right end of the plot

citric.acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The distribution of the citric.acid histogram is right-skewed .The range of data is 1. The peak is at 0 and this means that most red wines have zero critic acid.Most observations fall in the range 0.090 - 0.420.There is an outliers in the right end of the plot

residual.sugar

ggplot(wineQualityReds, aes(x=residual.sugar)) + geom_histogram(binwidth = 0.03)+ scale_x_log10() # Transforming the data, since the regular plotting will result in a long tailed distribution 

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The distribution of the residual.sugar histogram is right-skewed with a long tail in right end and with some gaps.The range of data is 14.6. The peak is at 2.Most observations fall in the range 1.900 - 2.600 and there are a lot of small bars displayed in the right end of the plot.

chlorides

ggplot(wineQualityReds, aes(x=chlorides)) + geom_histogram(binwidth = 0.03)+ scale_x_log10() # Transforming the data, since the regular plotting will result in a long tailed distribution 

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The distribution of the chlorides histogram is right-skewed with a long tail in right end and with some gaps although it looks normal in the left side around the peak.The range of data is 0.599. The median of the observations is 0.07900 .Most observations fall in the range 0.0.7000 - 0.09000 and there are a lot of small bars displayed in the right end of the plot.

free.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The distribution of the free.sulfur.dioxide histogram is right-skewed with a gap in the right around 60.The range of data is 71. The median of the observations is 14 .Most observations fall in the range 7 - 21.

total.sulfur.dioxide

ggplot(wineQualityReds, aes(x=total.sulfur.dioxide)) + geom_histogram(binwidth = 0.03)+ scale_x_log10() # Transforming the data, since the regular plotting will result in a long tailed distribution 

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The distribution of the total.sulfur.dioxide histogram is right-skewed with a short tail and with a gap between approximately 170 and 280.The range of data is 283. The median of the observations is 38 .Most observations fall in the range 22 - 62.

density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

The distribution of the density histogram is almost symmetric.The range of data is 0.0136 The median of the observations is 0.9968 .Most observations fall in the range 0.9956 - 0.9978.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The distribution of the pH histogram is almost symmetric.It could be bimodal because there are two closed peaks.The range of data is 1.27 The median of the observations is 3.310 .Most observations fall in the range 3.210 - 3.400.

sulphates

ggplot(wineQualityReds, aes(x=sulphates)) + geom_histogram(binwidth = 0.03)+ scale_x_log10() # Transforming the data, since the regular plotting will result in a long tailed distribution 

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The distribution of the sulphates histogram is right-skewed with a long tail and many gaps in the right which means that there are some outliers.The range of data is 1.67 The median of the observations is 0.6200 .Most observations fall in the range 0.5500 - 0.7300.

alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The distribution of the alcohol histogram is right-skewed with a gap in the right after 14.The peak is at 9.5. The range of data is 6.5 The median of the observations is 10.20 .Most observations fall in the range 9.50 - 11.10.

quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The range of data is 5 The median of the observations is 6 .Most red wines are of quality 5 and 6.

Univariate Analysis

What is the structure of the dataset?

There is 1599 observations of 13 variables (index variable is included even though it’s not that imortant in the analysis).Each observation indicates one of the red wine samples.

What is/are the main feature(s) of interest in the dataset?

The main feature of interest in this dataset is the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think all features in the dataset can support the investigation in a way or another. Some of the features may have big effect and some may have small effect but all can help to make the analysis and investigation easier and more accurate.

Are there any new variables created from existing variables in the dataset?

No.

Of the features investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The dataset is already tidy so, there was no need to adjust it. The first variable which is the index is not important but there was no need to remove it as keeping it will not affect the analysis badely. There were no unusual distributions and I noticed that most plots of the variables are right-skewed. There are no left-skewed distributions at all.

Bivariate Plots Section

This section visulizes a relationship between some variables in the dataset. A description of the each plot is stated clearely under each one.

##                          X fixed.acidity volatile.acidity citric.acid
## X                     1.00         -0.27            -0.01       -0.15
## fixed.acidity        -0.27          1.00            -0.26        0.67
## volatile.acidity     -0.01         -0.26             1.00       -0.55
## citric.acid          -0.15          0.67            -0.55        1.00
## residual.sugar       -0.03          0.11             0.00        0.14
## chlorides            -0.12          0.09             0.06        0.20
## free.sulfur.dioxide   0.09         -0.15            -0.01       -0.06
## total.sulfur.dioxide -0.12         -0.11             0.08        0.04
## density              -0.37          0.67             0.02        0.36
## pH                    0.14         -0.68             0.23       -0.54
## sulphates            -0.13          0.18            -0.26        0.31
## alcohol               0.25         -0.06            -0.20        0.11
## quality               0.07          0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## X                             -0.03     -0.12                0.09
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## X                                   -0.12   -0.37  0.14     -0.13    0.25
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## X                       0.07
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00

The strength of a relationship between two correlated variables is determened by looking at the numbers. A correlation of 0 means that no relationship exists between the two variables, whereas a correlation of 1 indicates a perfect positive relationship. It is uncommon to find a perfect positive relationship in the real world. Chances are that if we find a positive correlation between two variables that the correlation will lie somewhere between 0 and 1.

The observations from above correlation matrix and correlation plot:

1- There is no relatioship between volatile acidity and residual sugar variables. 2- There is a negligible relatioship between many variables for example fixed acidity and residual sugar. 3- All variables have negligible to weak positive relationship with quality except alcohol that has the strongest positive relationship with quality. 4- residual.sugarhas the weakest positive relationship with quality. 5- volatile.acidity has the strongest negative relationship with quality but all variables in general have negligible to moderate negative relationship with quality. 6- free.sulfur.dioxide has the weakest negative relationship with quality.

Here, a relationship between some variables is visulized using scatter plots.

ggplot(wineQualityReds, aes(x=fixed.acidity, y=citric.acid)) + geom_point(alpha = 1/2)+
xlim(0.00,14)+
ylim(0.00,0.75)

It seems that ther is a very strong positive relationship between fixed.acidity and citric.acid.

ggplot(wineQualityReds, aes(x=volatile.acidity, y=residual.sugar)) + geom_point(alpha = 1/2)+
 xlim(0.0,1.3)+
ylim(0,12) 

It seems that ther is no relationship between volatile.acidity and residual.sugar.

ggplot(wineQualityReds, aes(x=fixed.acidity, y=chlorides)) + geom_point(alpha = 1/2)+

ylim(0.0,0.2) 

It seems that ther is very negligible relationship between fixed.acidity and chlorides.

It seems that ther is strong negative relationship between fixed.acidity and pH.

ggplot(wineQualityReds, aes(x=sulphates, y=pH)) + geom_point(alpha = 1/2)+
xlim(0.0,1.4)

It seems that ther is weak negative relationship between sulphates and pH.

It seems that ther is moderate postive relationship between density and citric.acid.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

There is only one exact zero correlation between variables which is between volatile.acidity and residual.sugar.

In generla, all variables have negligible to weak positive relationship with quality but alcohol has the strongest positive relationship and residual.sugar has the weakest positive relationship.

In general, all variables have negligible to moderate negative relationship with quality but volatile.acidity has the strongest negative relationship and free.sulfur.dioxide has the weakest negative relationship.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

No, I didn’t.

What was the strongest relationship you found?

alcohol has the strongest positive relationship with quality.

The variables that have the strongest relationship among others are the following:

1- fixed.acidity and citric.acid 2- fixed.acidity and density 3- free.sulfur.dioxide and total.sulfur.dioxide

Multivariate Plots Section

The above plot shows the relationship between three variables: alcohol, pH and quality.Higher levels of alcohol associated with higher levels of quality but higher levels of PH associated with lower levels of quality.So,the wine becomes better when alcohol increases and PH decreases.

ggplot(data = wineQualityReds, aes(alcohol, sulphates, color = as.factor(quality))) +
  geom_point()+
ylim(0.0,1.4)+theme_dark()

The above plot shows the relationship between three variables: sulphates, alcohol and quality.Higher levels of sulphates associated with higher levels of quality and also higher levels of alcohol associated with higher levels of quality.So, the wine becomes better when both alcohol and sulphates increase.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The main and important observation I made from the above multivariate plots is that better wine has higher level of alcohol and sulphates but lower level of pH.

Were there any interesting or surprising interactions between features?

I have studied only the relationship between few variables and I didn’t notice any surprising interactions between them.

Final Plots and Summary

This section display three plots, each with its own description. I chose the first plot from each of the three section, univariate plots section, bivariate plots section and multivariate plots section.

Plot One

Description One

4.60 7.10 7.90 8.32 9.20 15.90

The distribution of the fixed.acidity histogram is right-skewed. The minimum value is 4.60 and the maximum is 15.90 so, the range of data is 11.3. The main peak is approximately at 7.There is a small gap between the range 15 and 15.5. Most observations fall in the range 7.10 - 9.20

Plot Two

Description Two

It seems that ther is a very strong positive relationship between fixed.acidity and citric.acid with some few outliers.

Plot Three

Description Three

The above plot shows the relationship between three variables: alcohol, pH and quality.Higher levels of alcohol associated with higher levels of quality but higher levels of PH associated with lower levels of quality.So,the wine becomes better when alcohol increases and PH decreases.

Reflection

The dataset I worked in for this project contains of 1599 obseravations of 12 main variables.It is already tidy, there was no need for cleaning in the begining of the project.

The project was very interesting. The easiest but longest part is plotting histogram and summarize each feature and the main difficulty I faced while working in this project is dealing with multivariate analysis because it is something new for me that I didnt do it before using python.

I studied the relationship between many variables and I noticed that there is no relatioship between volatile acidity and residual sugar variables..All variables have negligible to weak positive relationship with quality except alcohol that has the strongest positive relationship with quality.residual.sugarhas the weakest positive relationship with quality.volatile.acidity has the strongest negative relationship with quality but all variables in general have negligible to moderate negative relationship with quality.free.sulfur.dioxide has the weakest negative relationship with quality.

In the future, I wich I can invest more time studying the relationship of the variables I didn’t explore in the project and make more multivariate plots.